In this section, we provide a comparative analysis of the performance and energy results of the six designs.
We also discuss the sensitivity of several architectural parameters.
%experimental results associated with our design choices.




%\begin{figure*} [t]
%%\begin{figure*} [t]
%\centering
% \psfig{figure=figures/writebacks.eps, width=4.9in, height=2.5in}
% \caption{\label{fig:writebacks} \scriptsize \bf Number of write backs normalized to M-4MB for PARSEC 2.1 applications}
%%\end{figure*}
%\end{figure*}
%
%\begin{figure*} [t]
%%\begin{figure*} [t]
%\centering
% \psfig{figure=figures/spec-writebacks.eps, width=3.9in, height=2.5in}
% \caption{\label{fig:writebacks-spec} \scriptsize \bf Number of write backs normalized to M-4MB for SPEC multiprogrammed mixes}
%%\end{figure*}
%\end{figure*}


\subsection {Performance Comparison}
\begin{figure*} [t]
\centering
 \psfig{figure=figures/spec-ws.eps, width=0.5\textwidth, height=1.7in}
 \caption{\label{fig:spec-new} Normalized Average Instruction Throughput(IT) and Weighted Speedup(WS) for SPEC 2K6 multiprogrammed mixes. }
\end{figure*}


\begin{figure*} [t]
\centering
 \psfig{figure=figures/parsec-speedup.eps, width=6.9in, height=2.2in}
 \caption{\label{fig:parsec-new} Normalized speedup for PARSEC 2.1 applications. }
\end{figure*}


Figure~\ref{fig:parsec-new} shows speedup results with multithreaded PARSEC 2.1
applications along with the average improvements. Only 9 applications are shown to reduce clutter in the plots, however,
the average numbers are computed across the entire suite. All speedup numbers are normalized to S-1MB.

{\bf Speedup when replacing a SRAM with an ideal STT-RAM cache}: When aggregated across the entire PARSEC suite, we find that this hypothetical design has an average speedup of 23\% over the SRAM cache. This is the maximum performance that any scheme can provide.

{\bf Speedup when replacing a SRAM with 4MB, $10years$ retention time STT-RAM}: We find that, for the M-4MB design, all applications to the right of $x264$ (including $x264$) exhibit speedup improvements over S-1MB. Most of these applications are read intensive applications (details in Table~\ref{table:benchmark}) and thus, they benefit from not only the 4x capacity increase of STT-RAM, but also from the presence of a write buffer for the L2 cache. On an average, we find 6\% improvement in speedup for these applications. Although $ferret$ and $vips$ are write-intensive applications, they benefit with a 4x improvement in capacity when going from a S-1MB to a M-4MB. This is because, the write request to L2 cache banks in these two applications are staggered in time. Thus, a 10-entry write buffer proves to be sufficient to hide the long write latencies of a STT-RAM bank.

All applications to the left of $x264$ are write-intensive and have bursty requests arriving at L2 cache banks. For these applications, we observe significant degradation in speedup (on an average 11\% degradation) because of the high write latency of STT-RAM.
Overall, when averaged across the entire PARSEC suite, a traditional $10years$ STT-RAM gives a minimal 5\% speedup improvement over S-1MB. However, a $10years$ STT-RAM cache organization has 14\% lower performance when compared to the ideal design S-4MB and with write-intensive applications this difference is 21\%. It is this gap that our proposal seeks to bridge by tuning the retention time.

{\bf Speedup when replacing a SRAM with 4MB, $1sec$ retention time STT-RAM}: With such a STT-RAM cache bank, no refreshing schemes are employed. As shown earlier, almost all blocks get refreshed within a $1sec$ time interval. Reducing the retention time from $10years$ to $1sec$ reduced the write latency of a cache bank by 10 cycles (from 22 cycles to 12 cycles). This leads to significant speedup improvements over a $10years$ retention time STT-RAM cache organization. On an average, this reduction in 10 cycles lead to 6\% performance improvement (14\% for write intensive applications). However, this design is still 9\% (11\% for write intensive applications) lower in performance than the ideal case.
Further, when compared to the baseline S-1MB SRAM design, most of the write intensive and bursty application have no speedup gain compared to
Volatile-M-4MB(1sec).

{\bf Speedup when replacing a SRAM with 4MB, $10ms$ retention time STT-RAM without refreshing}: This volatile M-4MB(10ms) design also does not have any refreshing scheme, but the retention time of STT-RAM cells used is $10ms$. Using this STT-RAM device, after $10ms$ the STT-RAM device will lose its data and hence to keep the data integrity forces a large number of write-backs from the last-level cache to the main-memory before the actual expiration happens. Figure~\ref{fig:writebacks} shows number of write backs of all the designs normalized to M-4MB. We observe that the 4MB, $10ms$ retention time STT-RAM design, on an average, has 21\% more write backs than the traditional STT-RAM design.
This leads to significant performance degradation across most applications when compared to simply using a $10years$ retention time STT-RAM cache (8\% performance degradation on average). For instance, in $vips$, there is about 20\% speedup degradation over M-4MB. It is interesting to see the case of $swaptions$, where there is a slight improvement in speedup over M-4MB, although there is an increase in the number of write backs. The reason for this improvement is due to the fact that the majority of blocks that are not refreshed within $10ms$ interval, are not accessed in future as well leading to a low number of read misses. This helps in reaping benefits from the reduced write latency.

{\bf Speedup when replacing a SRAM with 4MB, $10ms$ retention time STT-RAM with refreshing}: This is our proposed scheme (annotated as Revived-M-4MB(10ms) in the figures), which incorporates refreshing of dirty blocks beyond the $10ms$ retention time.
Use of this scheme significantly improves performance when compared to all realistic design scenarios evaluated. On an average, the proposed revived scheme is better than the conventional SRAM design (S-1MB) by 18\%, traditional $10years$ STT-RAM by 15\% and over Volatile M-4MB(1sec)
by 4.5\%. The write latency of this STT-RAM cache bank is 6 cycles and when compared to a $1sec$ retention time STT-RAM, the difference of 6 cycles reduction in L2 cache bank access time helps in improving performance. This performance improvement is in spite of the increase in the number of write-backs (which increase by over 2x) compared to an SRAM cache. $Fluidanimate$ is the only application we found that does not show a gain in performance with this type of STT-RAM device. With $fluidanimate$ application, all the dirty blocks are almost equally distributed among all the ways, and hence, our scheme of considering only the first eight MRU slots is insufficient in reducing many writebacks (as shown in Figure~\ref{fig:writebacks}, with $fluidanimate$, the write-backs become 11x with this volatile STT-RAM).
%Facesim shows the maximum benefit when compared to the volatile STT-RAM (1 sec) cache: 22.7\%

\begin{figure*} [t]
\centering
\begin{tabular}{c}
\psfig{figure=figures/legend-spec.eps, width=5.5in, height=0.15in}
\end{tabular}
\\
\begin{tabular}{cc}
 \psfig{figure=figures/writebacks.eps, width=3.5in, height=1.5in} &
\psfig{figure=figures/spec-writebacks.eps, width=1.5in, height=1.5in} \\
 (a) PARSEC 2.1 Applications  & (b) Average across Multiprogrammed Mixes.
\end{tabular}
 \caption{Number of Write backs normalized to M-4MB.}
\label{fig:writebacks}
\end{figure*}



The Revived-M-4MB(10ms) scheme is closest to the ideal S-4MB case and is within 4\% of it, showing the benefits of our scheme in making the STT-RAM device a choice for universal memory.

{\bf Analysis with SPEC 2K6}: In general, the observations with SPEC benchmarks are consistent with those made with PARSEC 2.1 applications. Figure~\ref{fig:spec-new} shows instruction throughput and weighted speedup with the  SPEC 2K6 multiprogrammed
mixes. Simply replacing a SRAM bank with 4MB STT-RAM would lead to 11\% degradation in instruction throughput, and 4\% degradation in weighted speedup. However, employing our refreshing scheme on a $10ms$ volatile STT-RAM can lead to an average 22\%, 11\%  and 10\% improvement over M-4MB, Volatile M-4MB(1sec) and S-1MB, respectively. With a very write intensive mix ($bzip2$, $gcc$, $lbm$ and $hmmer$) this improvement can be as high as 35\% over the baseline SRAM design. Figure~\ref{fig:writebacks}(b) depicts the normalized write backs for
the SPEC 2.1 benchmarks.




%Figure~\ref{fig:spec-new} shows instruction throughput and weighted speedup for the  SPEC multiprogram
%mixes. We observe that our scheme revived-M-4MB gives 22\% improvement in instruction throughput
%over M-4MB, 11\% improvement over Volatile STT-RAM (1 sec), and 10\% improvement over
%the base line SRAM cache. (although not shown in the Figure due to brevity, in case of the mix of
%bzip2, gcc, lbm, hmmer, the improvement is 15\%). The weighted speedup improvement over M-4MB is 4\%
%and over Volatile STT-RAM(1 sec) is 2\%. (NEED TO EXPLIAN WHY annd more insight).
%The weighted speedup improvement over M-4MB is 4\%
%and over Volatile STT-RAM(1 sec) is 2\%. (NEED TO EXPLIAN WHY annd more insight).

Weighted speedup also follows a similar trend as instruction throughput and we find that our proposed refreshing scheme on a $10ms$ retention time volatile STT-RAM shows the best results. Overall, our proposed design is within 9\% (2\%) of the ideal device in terms of instruction throughput (weighted speedup).

\subsection{Energy Usage Comparison}
\vspace{0.1in}
\begin{figure*} [t]
\centering
\begin{tabular}{c}
\psfig{figure=figures/legend.eps, width=5.5in, height=0.15in}
\end{tabular}
\begin{tabular}{ccc}
 \psfig{figure=figures/leak-eng.eps, width=2.1in, height=1.5in} &
\psfig{figure=figures/dyn-eng.eps, width=2.1in, height=1.5in} &
\psfig{figure=figures/tot-eng.eps, width=2.1in, height=1.5in} \\
 (a) Leakage Energy  & (b) Dynamic Energy & (c) Total Energy
\end{tabular}
 \caption{Energy of applications normalized to that of S-1MB.}
\label{fig:energy}
\end{figure*}

Figure~\ref{fig:energy} shows normalized cache leakage, total of dynamic read and write energy, and total energy for a subset of PARSEC applications. While computing the energy numbers, we take into account the overheads of our proposed cache block refreshing (revived) schemes.
%The number of reads and writes to L2 cache are only considered for the calculation of dynamic energy.
We observe that on an average, there is 44\% improvement in total energy going from S-1MB to
M-4MB designs. This improvement is mainly because of the drastic reduction in leakage energy (43\%).
In general, all volatile STT-RAMs reduce the energy envelope of the LLC. With $1sec$ volatile STT-RAM,
total energy is reduced because of reduction in leakage energy and also nominal performance improvement.
However, when comparing
Volatile M-4MB(10ms) with Volatile M-4MB(1sec), because of larger number of write-backs with $10ms$ retention time STT-RAMs, the performance degrades and thus, leakage energy increases.

With our proposed cache block refreshment scheme, although there is an increase in the dynamic energy on account of additional back and forth writes to the buffer and cache lines, we consistently observe improvement in total energy (due to both reduction in leakage power and improvement in performance). On an average, we find 11\% energy benefits of using revived M-4MB design over Volatile-1sec and 30\% improvement over the baseline STT-RAM design. We observe similar benefits with SPEC 2K6 multiprogrammed applications, but for the sake of brevity we are not showing them.

%DRAM-style refresh scheme proposed in~\cite{STTRAM:HPCA11} has extremely

In conclusion, we find that our proposed revived scheme is significantly better in terms of performance and energy
when compared to a traditional non-volatile STT-RAM and even a volatile STT-RAM with $1sec$ retention time.
Further, the proposed $10ms$ retention time STT-RAM with revival schemes, is very close to the ideal LLC architecture in performance and significantly better in energy conservation.

%significantly improves performance of STT-RAM banks, and bridges the gap between traditional STT-RAMs and the maximum theoretical limit. We observe that our scheme is better in terms of both performance and energy over Volatile-4MB(1sec) and
%M-4MB designs.

%Volatile M-4MB(1sec) leakage benefits over M-4MB correlates with the performance improvement.
%On an average, this design consume more dynamic energy than M-4MB. The dynamic energy fluctuations among
%different applications are on account of changes in number of read and writes. Additional write backs
%triggers read misses which ultimately lead to additional writes to L2 cache. Write energies of 1sec design is more
%than the 10ms designs, which makes the fluctuations depend on the number of reads/writes.
%We see 11\% energy benefits of using revived M-4MB design over Volatile-1sec and 30\% improvement over
%M-4MB designs.The energy numbers of this scheme covers all the overheads of  the buffer design.
%We observe that our scheme is better in terms of both performance and energy over Volatile-4MB(1sec) and
%M-4MB designs.

\begin{figure*} %[t]
\begin{minipage}{0.45\textwidth}
\centering
 \psfig{figure=figures/energy-impact.eps, width=3in, height=1.3in}
 \caption{\label{fig:revive-dram} Energy impacts of Revive refresh scheme as a percentage of DRAM-style refresh}
\end{minipage}
\hfill %\vfill
\begin{minipage}{0.48\textwidth}
 \centering
 \psfig{figure=figures/confi.eps, width=3in, height=1.3in}
 \caption{\label{fig:confi} 95\% Confidence Intervals of diminishing blocks for each way}
\end{minipage}
\end{figure*}



Figure~\ref{fig:revive-dram} shows the energy comparisons of our revive refresh scheme as a percentage of
DRAM-style refresh scheme proposed in~\cite{STTRAM:HPCA11}. Since, our scheme only refreshes a small
number of required blocks as opposed to all the LLC blocks, on an average, it consumes only 4\% of energy compared
to the DRAM-style refresh. For the same reason, our scheme results in less than 90\% of performance penalty taking into our observation that about 10\% of the total diminishing blocks need to written back to the main memory. Although our scheme has 4\% buffer area overhead, the significant refreshing benefits obtained justifies our design. Moreover, refreshing benefits will diminish radically if it is done at {\it $\mu$s} level~\cite{STTRAM:HPCA11}, and hence our $ms$ proposal helps in sustaining the energy benefits.


\subsection{Sensitivity Analysis}
\noindent\textbf{Sensitivity to number of buffer entries:}
The number of buffer entries can affect the performance of the Revived-M-4MB scheme in two ways: increasing the
buffer size will accommodate more diminishing blocks at a particular instance and decreasing the buffer size will
lead to more buffer overflows. Increasing the buffer size, leads to fewer buffer overflows,
but this reduction comes at a cost of increase in buffer area and consequent revival overheads. Decreasing the buffer size, eventually leads to additional write backs (discussed in Section ~\ref{sec:implementation}).

%\begin{wrapfigure}{l}{0.50\textwidth}
%\centering
% \psfig{figure=figures/confi.eps, width=1.6in, height=0.4\textwidth}
% \caption{\label{fig:confi} \scriptsize \bf 95\% Confidence Intervals of Diminished Blocks for each Way}
%\end{wrapfigure}




To find the optimal buffer size, we calculate the 95\% Confidence Intervals (CIs) for the cumulative distribution of
diminishing blocks per bank. This is shown in Figure~\ref{fig:confi}. We observe that, for the first 8 MRU slots, the
mean value of the buffer entries is 1900 blocks, which corresponds to a 3\% area overhead per L2 cache bank.
Upper limit of the 95\% CI corresponds to 2500 blocks,  which represents 4\% area overhead per L2 cache bank.
The lower limit of 95\% CI corresponds to 1300 blocks (2\% area overhead).

Figure~\ref{fig:buf-mru}(a) shows the normalized speedup to 1300 blocks, with a subset of PARSEC applications by varying the number of buffer entries. On an average, varying the number of buffer entries from 1900 to 2500,results in only 1\% speedup improvement.
Hence, in all our results, we used 1900 buffer entries (resulting in a 3\% area overhead)with the best possible performance per area-overhead.

\begin{figure*} [t]
\centering
\begin{tabular}{cc}
 \psfig{figure=figures/buffer.eps, width=3.2in, height=1.5in} &
\psfig{figure=figures/slots.eps, width=3.2in, height=1.5in} \\
(a) Buffer Entries & (b) MRU Slots.
\end{tabular}
 \caption{Speedup as a function of number of Buffer Entries and MRU Slots }
\label{fig:buf-mru}
\end{figure*}


\noindent\textbf{Sensitivity to number of MRU slots:}
Choosing the optimal number of MRU slots to buffer was discussed briefly in Section~\ref{sec:implementation}.
Figure~\ref{fig:cdf} shows the mean cumulative distribution of diminishing blocks per
way across all PARSEC benchmarks. We find that after 8 MRU slots, the number of diminishing blocks
saturates, which suggests that  the optimal number of MRU slots is 8. Figure~\ref{fig:buf-mru}(b) shows
speedup of a subset of PARSEC applications along with the average across 12 PARSEC applications, with varying
number of MRU slots. Buffer size is kept constant at 1900 per bank. We see degradation in performance when we decrease
the number of slots from 8 to 4, since buffering 8 MRU slots would have covered more frequently used blocks and hence, reducing
write backs of useful blocks.
We also see degradation in performance by increasing the number of slots from 8 to 12.
This is because, with a constant buffer
size, 12 MRU slots increase the probability of buffer overflows, which increases the write backs leading to performance degradation.


